Loading the MASS “Boston” dataset already loaded in R. Link to dataset
This is a data set about Housing Values in Suburbs of Boston.

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. < em >J. Environ. Economics and Management < b >5, 81???102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) < em >Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley

Variables in the dataset:

  1. crim= per capita crime rate by town.
  2. zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus = proportion of non-retail business acres per town.
  4. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. nox = nitrogen oxides concentration (parts per 10 million).
  6. rm = average number of rooms per dwelling.
  7. age = proportion of owner-occupied units built prior to 1940.
  8. dis = weighted mean of distances to five Boston employment centres.
  9. rad = index of accessibility to radial highways.
  10. tax = full-value property-tax rate per $10,000.
  11. ptratio = pupil-teacher ratio by town.
  12. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  13. lstat = lower status of the population (percent).
  14. medv = median value of owner-occupied homes in $1000s.
library(MASS) 
data("Boston")
dim(Boston) #506 observation of 14 variables
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Names of the variables in the data
506 observation of 14 variables

gather(Boston) %>% 
  ggplot(aes(value)) + facet_wrap("key", scales= "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualisation of the variables. the variable rm is the only one looking normally distribiuted, the others seem rather skewed.

Correlation plots

library(tidyr)
library("ggplot2")
library("GGally")
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library("corrplot")
## corrplot 0.84 loaded
cor_matrix <- cor(Boston)%>%round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
p1 <- ggcorr(cor_matrix,geom="circle")
# correlation plot matrix 
p1

?ggcorr
p2<- corrplot(cor_matrix, method = "circle",type="upper",cl.pos="b",tl.pos="d",tl.cex = 0.6)

p2
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Looking at the correlations plot we can see that positively correlated variables are; rad (index of accesibility to radial highways), lstat (lower status of the population, percent) tax (full-value property-tax rate per $10,000) to crim (our variable of interest is crimerate per capita by town). Negatively associated with crim are medv (median value of owner-occupied homes in $1000s) and dis (weighted mean of distances to five Boston employment centres).

Scaling the dataset

scaled_boston <- scale(Boston)
summary(scaled_boston)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
summary(scaled_boston$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
crim_vector <- quantile(scaled_boston$crim) #new crime variable
crim_vector
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <-cut(scaled_boston$crim, breaks = crim_vector, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
scaled_boston <- dplyr::select(scaled_boston, -crim)
scaled_boston <- data.frame(scaled_boston,crime)
n <- nrow(scaled_boston)
n
## [1] 506
train <- sample(n, size = n * 0.8)
train_set <- scaled_boston[train,] # training set with 80% of obs.
dim(train_set)
## [1] 404  14
test <- scaled_boston [-train,] # test set with 20% of obs.
dim(test)
## [1] 102  14
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Linear discriminant analysis

lda1<-lda(crime~., data=train_set)
lda1
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2524752 0.2376238 0.2500000 
## 
## Group means:
##                  zn      indus         chas        nox         rm
## low       1.0255060 -0.9704232 -0.084848104 -0.9033402  0.4853065
## med_low  -0.1402296 -0.2150198 -0.002135914 -0.5445225 -0.1693487
## med_high -0.3791541  0.2207722  0.342842844  0.4184796  0.1877371
## high     -0.4872402  1.0171306 -0.155385496  1.0156493 -0.3928745
##                 age        dis        rad        tax     ptratio
## low      -0.9127882  0.9197667 -0.6865511 -0.7229099 -0.48447729
## med_low  -0.3031989  0.2775597 -0.5585147 -0.4459476 -0.06595478
## med_high  0.4419249 -0.4235225 -0.3741407 -0.2850365 -0.45339490
## high      0.7969423 -0.8480999  1.6379981  1.5139626  0.78062517
##                black       lstat       medv
## low       0.37655659 -0.77203627  0.5753937
## med_low   0.30928750 -0.05814543 -0.0462062
## med_high  0.07180179 -0.01150049  0.2550062
## high     -0.74199322  0.97553322 -0.7596165
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09018311  0.768620075 -0.76721089
## indus    0.05899202 -0.406226484  0.35021226
## chas    -0.09859431 -0.103881331 -0.01074262
## nox      0.35974494 -0.549198616 -1.38752235
## rm      -0.10804618 -0.091099853 -0.10700659
## age      0.29884902 -0.287099988 -0.23657261
## dis     -0.04519730 -0.228077340 -0.04805085
## rad      3.09585371  0.768304680 -0.16273782
## tax     -0.10581076  0.071867706  0.48226534
## ptratio  0.05227493  0.180134604 -0.18895075
## black   -0.13696293 -0.005543041  0.11817477
## lstat    0.21483170 -0.139471695  0.39260667
## medv     0.16769680 -0.295003538 -0.29847084
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9371 0.0459 0.0170
classes <- as.numeric(train_set$crime) # 1,2,3,4 in stead of low,med_low,med_high, high

LDA biplot

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda1, myscale = 1)

##Predict with LDA

# predict classes with test data
lda.pred <- predict(lda1, newdata = test)
# cross tabulate the results
table(correct = lda.pred$class, prediction = correct_classes)%>%addmargins()
##           prediction
## correct    low med_low med_high high Sum
##   low        9       5        1    0  15
##   med_low   11      19       15    0  45
##   med_high   2       0       14    0  16
##   high       0       0        0   26  26
##   Sum       22      24       30   26 102

Prediction was right in 56% of test set
high crime was prdicted in 16/18 cases
med_high crime was predicted in 18/24
high crime was predicted in 26/27

Clustering

library(MASS)
data("Boston")
scaled_boston <- scale(Boston)

K-means clustering

#determining the number of clusters
wss <- (nrow(scaled_boston)-1)*sum(apply(scaled_boston,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(scaled_boston, 
    centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
  ylab="Within groups sum of squares")

km<- kmeans(scaled_boston,3)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       42    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
km1 <- kmeans(scaled_boston, 5)
summary(km1)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       70    -none- numeric
## totss          1    -none- numeric
## withinss       5    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           5    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

3 clusters would be appropriate.

pairs(scaled_boston, col=km$cluster)

BONUS

data("Boston")
scaled_boston <- scale(Boston)
dim(scaled_boston)
## [1] 506  14
km_bonus <- kmeans(scaled_boston, 4)
class(scaled_boston)
## [1] "matrix"
scaled_boston <- as.data.frame(scaled_boston)
class(scaled_boston)
## [1] "data.frame"
lda2 <- lda(km_bonus$cluster~., data = scaled_boston)
lda2
## Call:
## lda(km_bonus$cluster ~ ., data = scaled_boston)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.1047431 0.4031621 0.2529644 0.2391304 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1  1.9480908 -0.4872402  1.0149946 -0.272329068  1.0357640 -0.6389486
## 2 -0.3883281 -0.3346480 -0.4362541 -0.002135914 -0.4591933 -0.2307927
## 3  0.1886274 -0.4872402  1.2111512  0.127532675  1.0860073 -0.3665114
## 4 -0.3981339  1.2930469 -0.9902994 -0.012024920 -0.8283387  1.0566896
##          age        dis        rad        tax     ptratio       black
## 1  0.8260197 -0.9301775  1.6596029  1.5294129  0.80577843 -2.13287198
## 2 -0.1783550  0.2117619 -0.5770929 -0.6174348  0.09480641  0.31558148
## 3  0.8044599 -0.8146389  0.7955609  0.9872011  0.34603822  0.04414302
## 4 -0.9121115  0.9121798 -0.5955687 -0.6732556 -0.87884014  0.35548171
##        lstat        medv
## 1  1.4464782 -1.18461968
## 2 -0.1423656 -0.09135041
## 3  0.5301732 -0.40331922
## 4 -0.9544043  1.09954700
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## crim    -0.31028992  0.62224451  0.597212526
## zn      -0.09338930  0.96434025 -0.716986122
## indus   -1.40992137 -0.16185691 -1.004128879
## chas     0.06950551 -0.12316233  0.073570823
## nox     -0.48313830  0.10654440 -0.870736311
## rm       0.25553149  0.52639418 -0.282378921
## age     -0.07134757 -0.31940794 -0.045053040
## dis      0.06345902  0.01924507 -0.380206568
## rad     -0.85654373  0.09761292  0.003307117
## tax     -0.37874646  0.14644700 -0.545068105
## ptratio -0.01835256 -0.12874013  0.128579267
## black    0.56785879 -0.68819820 -0.950512993
## lstat   -0.09283319  0.55997738  0.261272880
## medv     0.20180072  0.60601649 -0.347080865
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7798 0.1174 0.1029
classes1 <- as.numeric(km_bonus$cluster)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda2, dimen = 2, col = classes1, pch = classes1)
lda.arrows(lda2, myscale = 1)

##superbonus

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda1$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda1$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_set$crime)